Exploring Red Wine Quality

Udacity Nanodegree Data Analysis with R

By Casimir COMPAORE

Question and dataset

We want to understand which chemical properties influence the quality of red wines.

The data set can be downloaded here and metadata information is available here.

We will use the following R libraries: ggplot2, gridExtra, GGally, stats and memisc.

To read the dataset into a variable reds, I will use R read.csv function.

The variables names of the dataset and their structures are displayed as follow:

reds <- read.csv('wineQualityReds.csv')
names(reds)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
str(reds)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

This dataset has 1599 rows or observations in it with 13 variables. Each observation represents an evalution of a red wine quality.

  • The variable X is just a unique identifier for the records;
  • The quality of red wine is ordered, categorical and discrete;
  • The variables free.sulfur.dioxide and total.sulfur.dioxide are discrete;
  • The rest of the variables are continuous.

Exploratory Data Analysis

I will first explore the dataset one variable at a time. Then I will explore two variables, and multiple variables. And finally I will predict the red wine quality based of some chemical properties.

Explore one variable

I will use the qplot function to plot a histogram of each feature. We will also display some descriptive statistics about each feature using summary function. Then we will interpret the distribution of each feature.

We write a custom function called getOutlierExtremes to return weak and strong outliers extremes. The calculation is based on inter-quantile range (IQR).

If we subtract 1.5 x IQR from the first quartile, any data values that are less than this number are considered at least as weak outliers. Similarly if we add 1.5 x IQR to the third quartile, any data values that are greater than this number are considered at least as weak outliers. And applying the same rule based on 3 x IQR, we determine strong outliers.

getOutlierExtremes <- function(feature){
  desc <- summary(feature)
  #Calcuting the interquatile range
  iqr <- IQR(feature)
  #Weak outliers extremes
  weakLow <- desc[2] - (1.5 * iqr)
  weakHigh <- desc[5] + (1.5 * iqr)
  #Strong outliers extremes
  strongLow <- desc[2] - (3 * iqr)
  strongHigh <- desc[5] + (3 * iqr)
  values <- c(iqr, weakLow, strongLow, weakHigh, strongHigh)
  names(values) <- c('IQR', 'Weak Low Ext.','Strong Low Ext.','Weak High Ext.','Strong High Ext.')
  return (values)
}

Feature: fixed.acidity

summary(reds$fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
getOutlierExtremes(reds$fixed.acidity)
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##             2.10             3.95             0.80            12.35 
## Strong High Ext. 
##            15.50

The feature fixed.acidity is not normally-distributed. But the feature is normally-distributed on logarithmic base.

The feature fixed.acidity has some strong outliers.

Feature: volatile.acidity

summary(reds$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
getOutlierExtremes(reds$volatile.acidity)
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##            0.250            0.015           -0.360            1.015 
## Strong High Ext. 
##            1.390

The feature volatile.acidity is normally-distributed with some strong outliers.

Feature: citric.acid

summary(reds$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
getOutlierExtremes(reds$citric.acid)
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##            0.330           -0.405           -0.900            0.915 
## Strong High Ext. 
##            1.410

The feature citric.acid has many 0 values, is not normally-distributed and has some weak outliers.

Feature: residual.suggar

summary(reds$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
getOutlierExtremes(reds$residual.sugar)
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##             0.70             0.85            -0.20             3.65 
## Strong High Ext. 
##             4.70

The feature residual.sugar has many strong outliers, has a long tail, and is not normally-distributed.

Feature: chlorides

summary(reds$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
getOutlierExtremes(reds$chlorides)
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##             0.02             0.04             0.01             0.12 
## Strong High Ext. 
##             0.15

The feature chlorides has some weak outliers on the left side, and strong outliers on the right side. It has a long tail, and is not normally-distributed.

Feature: free.sulfur.dioxide

summary(reds$free.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
getOutlierExtremes(reds$free.sulfur.dioxide)
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##               14              -14              -35               42 
## Strong High Ext. 
##               63

The feature free.sulfur.dioxide has strong outliers, and is not normally-distributed.

Feature: total.sulfur.dioxide

summary(reds$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
getOutlierExtremes(reds$total.sulfur.dioxide)
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##               40              -38              -98              122 
## Strong High Ext. 
##              182

The feature total.sulfur.dioxide has strong outliers, and is not normally-distributed.

Feature: density

summary(reds$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040
getOutlierExtremes(reds$density)
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##        0.0022350        0.9922475        0.9888950        1.0011525 
## Strong High Ext. 
##        1.0045050

The feature density is normally-distributed. It has weak outliers on both sides.

Feature: pH

summary(reds$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
getOutlierExtremes(reds$pH)
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##            0.190            2.925            2.640            3.685 
## Strong High Ext. 
##            3.970

The feature pH is normally-distributed. It has weak outliers on the left side and strong outliers on the right side.

Feature: sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##             0.18             0.28             0.01             1.00 
## Strong High Ext. 
##             1.27

The feature sulphates is not normally-distributed and has strong outliers.

Feature: alcohol

summary(reds$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
getOutlierExtremes(reds$alcohol)
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##              1.6              7.1              4.7             13.5 
## Strong High Ext. 
##             15.9

The feature alcohol is not normally-distributed and has weak outliers.

Feature: quality

summary(reds$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
getOutlierExtremes(reds$quality)
##              IQR    Weak Low Ext.  Strong Low Ext.   Weak High Ext. 
##              1.0              3.5              2.0              7.5 
## Strong High Ext. 
##              9.0

The feature quality is normally-distributed. It has weak outliers on both side.

Exploring two variables

I will first explore the target feature (quality) relation with each input feature. Then I will explore the relationships between input features.

Relationship between quality feature and input variables

Good wines seem to have higher fixed.acidity.

Good wines seem to have lower volatile.acidity.

Good red wines seem to have higher citric.acid.

Residual sugar did not seem to have a dramatic impact on the quality the red wines.

Good red wines seem to have lower chlorides.

Sulfur dioxides did not seem to have a dramatic impact on the quality of the red wines.

Good red wines seem to have lower density.

Good red wines seem to have lower pH.

Good red wines seem to have lower sulphates.

Good wines seem to have higher alcohol.

Correlations between features

I will use ggpairs to visualize the correlations between features in the red wines dataset.

I assume existence of correlation between two features where the correlation coefficient is outside the intervale [-0.32, 0.32].

Some observations:

  • Correlation exists between quality and alcohol
  • Correlation exists between quality and volatile.acidity
  • Correlation exists between citric.acid and volatile.acidity
  • Correlation exists between citric.acid and Positive fixed.acidity
  • Correlation exists between free.sulfur.dioxide and total.sulfur.dioxide
  • Correlation exists between density and residual.sugar
  • Correlation exists between density and fixed.acidity
  • Correlation exists between density and citric.acid
  • Correlation exists between pH and density
  • Correlation exists between pH and citric.acid
  • Correlation exists between pH and fixed.acidity
  • Correlation exists between sulphates and chlorides
  • Correlation exists between alcohol and density

Explore multiple variables

I will compute a new categorial variable based on quality called rating. For red wines quality rated less than 5, rating will have the value ‘bad’. For red wines quality rated 5 or 6, rating will have the value ‘average’. And rating will have value for redquality egal 7 and above. This feature will be used to color our plots.

The features volatile.acidity and alcohol are correlated to the target feature quality. I test the independence between the features volatile.acidity and alcohol using pearson correlation method. The density and pH features are normally distributed. But the density feature is strongly correlated to the alcohol feature. The pH feature is neither correlated to the alcohol, neither to the volatile.acidity feature. I thought of creating a categorial variable based on pH, but I abondonned the idea as all pH in the dataset are less than 7, specific to acidic solutions. The rest of the features where not normally distributed even after log10 or sqrt transformations and are not correlated to the quality feature.

Let test the independence of the feature alcohol and the feature volatile.acidity using correlation test with pearson method.

#Compute correlation between features alcohol and volatile.acidity for independence
cor.test(reds$volatile.acidity, reds$alcohol, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  reds$volatile.acidity and reds$alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2488416 -0.1548020
## sample estimates:
##       cor 
## -0.202288

With 95 percent confidence interval between -0.2488416 and -0.1548020 of correlation coefficient, the features volatile.acidity and alcohol are independent.

I display boxplots of alcohol/quality and volatile.acidity/quality color coded by quality rating.

From these boxplots, the quality of red wine is good when the acohol percentage is high. And the quality of red wine is good at low values of the volatile.acidity.

I display the points plot of volatile.acidity/alcohol features color coded with quality rating feature.

Again the red wines rated good are at low values of volatile.acidity and high values of alcohol.

bads <- subset(reds, reds$rating == 'bad')
cor.test(bads$quality, bads$volatile.acidity, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  bads$quality and bads$volatile.acidity
## t = -2.3049, df = 61, p-value = 0.02459
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.49602400 -0.03794003
## sample estimates:
##        cor 
## -0.2830444
cor.test(bads$quality, bads$alcohol, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  bads$quality and bads$alcohol
## t = 0.97924, df = 61, p-value = 0.3313
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1272830  0.3610418
## sample estimates:
##       cor 
## 0.1244053

A correlation doesn’t exist between quality and volatile.acidity for bad red wines. And a correlation doesn’t exist between quality and alcohol for bad red wines.

averages <- subset(reds, reds$rating == 'average')
cor.test(averages$quality, averages$volatile.acidity, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  averages$quality and averages$volatile.acidity
## t = -8.8607, df = 1317, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2874883 -0.1855937
## sample estimates:
##        cor 
## -0.2371933
cor.test(averages$quality, averages$alcohol, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  averages$quality and averages$alcohol
## t = 14.69, df = 1317, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3278896 0.4206801
## sample estimates:
##       cor 
## 0.3752245

A correlation doesn’t exist between quality and volatile.acidity for average red wines. But a correlation exists between quality and alcohol for average red wines.

goods <- subset(reds, reds$rating == 'good')
cor.test(goods$quality, goods$volatile.acidity, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  goods$quality and goods$volatile.acidity
## t = 0.54322, df = 215, p-value = 0.5875
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0966390  0.1693712
## sample estimates:
##        cor 
## 0.03702191
cor.test(goods$quality, goods$alcohol, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  goods$quality and goods$alcohol
## t = 2.592, df = 215, p-value = 0.0102
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0418609 0.3002971
## sample estimates:
##       cor 
## 0.1740748

A correlation doesn’t exist between quality and volatile.acidity for good red wines. But a correlation doesn’t exist between quality and alcohol for average red wines.

goodOrBads <- subset(reds, reds$rating != 'average')
cor.test(goodOrBads$quality, goodOrBads$volatile.acidity, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  goodOrBads$quality and goodOrBads$volatile.acidity
## t = -12.876, df = 278, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6797322 -0.5321147
## sample estimates:
##        cor 
## -0.6112117
cor.test(goodOrBads$quality, goodOrBads$alcohol, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  goodOrBads$quality and goodOrBads$alcohol
## t = 9.7474, df = 278, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4118364 0.5871768
## sample estimates:
##       cor 
## 0.5046932

A correlation exists between quality and volatile.acidity for a subset of the data for good or bad red wines. A correlation exists between quality and alcohol for a subset of the data for good or bad red wines.

Quality predictions

I will use a linear model to predict the quality of red wines based on independent input features.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = reds)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = reds)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + pH, data = reds)
## 
## ===============================================
##                      m1        m2        m3    
## -----------------------------------------------
## (Intercept)        1.875***  3.095***  4.269***
##                   (0.175)   (0.184)   (0.369)  
## alcohol            0.361***  0.314***  0.330***
##                   (0.017)   (0.016)   (0.017)  
## volatile.acidity            -1.384*** -1.279***
##                             (0.095)   (0.099)  
## pH                                    -0.422***
##                                       (0.115)  
## -----------------------------------------------
## R-squared             0.227     0.317     0.323
## adj. R-squared        0.226     0.316     0.321
## sigma                 0.710     0.668     0.665
## F                   468.267   370.379   253.328
## p                     0.000     0.000     0.000
## Log-likelihood    -1721.057 -1621.814 -1615.101
## Deviance            805.870   711.796   705.845
## AIC                3448.114  3251.628  3240.202
## BIC                3464.245  3273.136  3267.087
## N                  1599      1599      1599    
## ===============================================

With small R-Squared values, the linear model seems not adapted in predicting the quality of red wines for the entire dataset.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = subset(reds, rating != 
##     "average"))
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = subset(reds, 
##     rating != "average"))
## m3: lm(formula = quality ~ alcohol + volatile.acidity + pH, data = subset(reds, 
##     rating != "average"))
## 
## ===============================================
##                      m1        m2        m3    
## -----------------------------------------------
## (Intercept)       -0.668     2.605***  6.055***
##                   (0.724)   (0.646)   (1.237)  
## alcohol            0.625***  0.475***  0.539***
##                   (0.064)   (0.053)   (0.056)  
## volatile.acidity            -3.319*** -2.834***
##                             (0.274)   (0.308)  
## pH                                    -1.329** 
##                                       (0.409)  
## -----------------------------------------------
## R-squared            0.255     0.513     0.531 
## adj. R-squared       0.252     0.509     0.525 
## sigma                1.201     0.973     0.957 
## F                   95.012   145.657   103.979 
## p                    0.000     0.000     0.000 
## Log-likelihood    -447.573  -388.120  -382.861 
## Deviance           400.961   262.223   252.556 
## AIC                901.146   784.239   775.722 
## BIC                912.050   798.778   793.896 
## N                  280       280       280     
## ===============================================

For a subset of the data reduced to good wines and bad wines only, we note improved values for R-Squared.

Final Plots and Summary

Plot 1: Histogram of Wine Quality

I will display histogram of wine quality to see how quality is distributed in our dataset.

The quality rating with highest number is 6. And we can see that most of wines in our dataset is rated between 5 and 7.

Plot 2: Histogram of Wine alcohol color coded with quality rating

I will display a histogram of wine alcohol. Each bar is proportionally color coded using the quality rating.

The color coded histogram shows that the proportion of good wines is increasing when alcohol percentage is increasing.

Plot 3: Scatter plot of Good Wines and Bad Wines

I will display a scatter plot of good wines and bad wines with volatile acidity on the x-axis and alcohol on the y-axis.

The colored scatter plot shows that good wines are on high alcohol percentage and low volatile acidity density. It also shows at the same time that bad wines are on low alcohol percentage and high volatile acidity density.

Reflection

Based on the EDA, alcohol percentage is one of the most important factor that impacts the quality of red wine. Another important factor impacting the quality of a red wine is volatile Acidity.

We have been able to better predict red wine quality for good wines and bad wines based on Alcohol and volatle acidity features. But for average quality rating, the prediction of the quality is performing poorly. Then we have not been able to determine the input features that impact average quality rating red wines.

In this case for example, where the quality is determined by tasting the wine by experts, there is room for subjectivity. I will recommend using a model like random forest to predict the quality.

Another issue was validating the model. Because we didn’t divide the dataset into model training data and testing data, there is a risk of over fitting the data regardless of the model.